Multivariate Categorical (Chap 7)

What does multivariate data look like?

Multiple variables but not a multivariate plot:

Source: https://www.cdc.gov/nchs/images/databriefs/201-250/db217_fig1.png

Multivariate categorical data

Frequency

Proportion / Association

Change of State

Two continuous variables

Two categorical variables

Age Favorite
young bubble gum
old coffee
young bubble gum
old coffee
young bubble gum
young bubble gum
old coffee
young coffee
young bubble gum
old coffee
young bubble gum
old bubble gum
old bubble gum
young bubble gum

Two categorical variables

Two discrete variables

(same problem)

Stacked bar chart

Grouped bar chart

Grouped bar chart

Grouped bar chart w/ facets

Grouped bar chart w/ facets

Three categorical variables

Cleveland dot plot (two variables)

Cleveland dot plot (three variables)

Cleveland dot plot (three variables)

Cleveland dot plot (three variables)

Proportion / Association

Are older Americans more interested in local news than younger Americans?

We ask 34892 U.S. adults whether or not they follow local news “very closely”. 34.5% say yes.

Group sizes are:

Age Freq
18-29 2851
30-49 9967
50-64 11163
65+ 10911

Source: https://www.journalism.org/2019/08/14/methodology-local-news-demographics/

If older Americans are NOT more interested in local news, what would the breakdowns look like?

Who follows local news?

Age Followers Nonfollowers
18-29 984 1867
30-49 3439 6528
50-64 3851 7312
65+ 3764 7147

How can we graph this data?

Mosaic plot

Now let’s look at the actual data…

U.S. adults who closely follow local news

Chi Square Test of Independence

Null hypothesis: Age and tendency to follow local news are independent

Alternative hypothesis: Age and tendence to follow local news are NOT independent

We compare OBSERVED to EXPECTED:

localmat <- as.matrix(local[,2:3])
rownames(localmat) <- local$Age
X <- chisq.test(localmat, correct = FALSE)
X$observed
##       Followers Nonfollowers
## 18-29       428         2423
## 30-49      2791         7176
## 50-64      4242         6921
## 65+        4583         6328
X$expected
##       Followers Nonfollowers
## 18-29  984.1065     1866.893
## 30-49 3440.4032     6526.597
## 50-64 3853.2378     7309.762
## 65+   3766.2526     7144.747
X
## 
##  Pearson's Chi-squared test
## 
## data:  localmat
## X-squared = 997.48, df = 3, p-value < 0.00000000000000022

Compare

Creating mosaic plots

Mosaic plots (one variable)

Mosaic plots (one variable)

Mosaic plots (two variables)

Mosaic plots (two variables)

What if there were no relationship between the variables?

Mosaic plots (two variables)

What if there were a deterministic relationship between the variables?

Mosaic plots (three variables)

Mosaic plots (three variables, Observed vs. Expected)

Best practices

  • Dependent variables is split last and split horizontally

  • Fill is set to dependent variable

  • Other variables are split vertically

  • Most important level of dependent variable is closest to the x-axis and darkest (or most noticable shade)

Mosaic pairs plot

Mosaic pairs plot

Occupational Mobility data (UK only)

vcdExtra::Yamaguchi87

Father Son Freq
UpNM UpNM 474
UpNM LoNM 129
UpNM UpM 87
UpNM LoM 124
UpNM Farm 11
LoNM UpNM 300
LoNM LoNM 218
LoNM UpM 171
LoNM LoM 220
LoNM Farm 8
UpM UpNM 438
UpM LoNM 254
UpM UpM 669
UpM LoM 703
UpM Farm 16
LoM UpNM 601
LoM LoNM 388
LoM UpM 932
LoM LoM 1789
LoM Farm 37
Farm UpNM 76
Farm LoNM 56
Farm UpM 125
Farm LoM 295
Farm Farm 191

UK Mosaic

##       Son
## Father UpNM LoNM UpM  LoM Farm
##   UpNM  474  129  87  124   11
##   LoNM  300  218 171  220    8
##   UpM   438  254 669  703   16
##   LoM   601  388 932 1789   37
##   Farm   76   56 125  295  191
##       Son
## Father     UpNM      LoNM      UpM       LoM      Farm
##   UpNM 187.4910 103.72052 196.9201  310.7646  26.10383
##   LoNM 208.3991 115.28693 218.8797  345.4195  29.01480
##   UpM  472.7045 261.50144 496.4774  783.5034  65.81328
##   LoM  851.5499 471.07976 894.3754 1411.4361 118.55883
##   Farm 168.8555  93.41133 177.3474  279.8764  23.50926

UK Mosaic (no mobility hypothetical)

Mosaic Mobility Plot

Mosaic Mobility Plot

Plot names

mosaic plot = any filled rectangular plot (no white space) with consistent numbers of rows and columns, in which the area of each small rectangle is proportional to the frequency count for a unique combination of levels of the categorical variables displayed

Plot names: mosaic plot vs. tree map

mosaic plot = filled rectangular plot with consistent number of rows and columns, where each small rectangle represents a unique combination of levels of factors of the variables displayed

treemap = filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)

Mosaic plots: spine plot

spine plot = mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction

Mosaic plots

MASS::housing

Same bin width mosaic plots

= relative frequency stacked bar charts

p. 145

Categorical data formats

cases

Name Age Favorite Music
Emma young bubble gum rock
Linda old coffee classical
Emily young bubble gum rock
Deborah old coffee classical
Charlotte young bubble gum rock
Olivia young bubble gum classical
Barbara old coffee rock
Sophia young coffee classical
Ava young bubble gum rock
Patricia old coffee classical
Isabella young bubble gum rock
Nancy old bubble gum classical
Karen old bubble gum rock
Harper young bubble gum classical

counts

Age Favorite Freq
old bubble gum 2
old coffee 4
young bubble gum 7
young coffee 1

pivot table / contingency table

##        Favorite
## Age     bubble gum coffee
##   old            2      4
##   young          7      1

Conversions

http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/#counts-to-contingency-table

–> MosaicCoding.R

Likert data

Gender equality survey (Pew Research Center)

Stacked Bar Chart

Source: https://blog.datawrapper.de/divergingbars/

Diverging Stacked Bar Chart

Source: https://blog.datawrapper.de/divergingbars/

Diverging Stacked Bar Chart w/ Separate Neutrals

Source: https://blog.datawrapper.de/divergingbars/

Diverging stacked bar chart

Diverging stacked bar chart (neutrals removed)

Political Interests

Source: “Perceived losses of scientific integrity under the Trump administration: A survey of federal scientists”

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231929

Diverging stacked bar chart

HH package

Additional mosaic plot parameters: abbreviate labels

(Note: these examples are designed to highlight one parameter at a time. They should not be taken as complete examples as they do not follow all best practices.)

Abbreviate labels, one setting per variable in order of variable splits

labeling = labeling_border(abbreviate_labs = c(FALSE, 3, 6))

See ?vcd::labelings

Additional mosaic plot parameters: adjust spacing

Change spacing between factor levels, one setting per variable in order of variable splits

spacing = spacing_dimequal(c(.5, .1, 0))

See ?vcd::spacings

Additional mosaic plot parameters: adjust variable names

set_varnames inside labeling function

labeling = labeling_border(set_varnames = c(...))

Additional mosaic plot parameters: rotate labels

Change angle of displayed factor levels, one setting per side, starting with top (top, right, bottom, left)

rot_labels = c(0, 0, 0, 0)

(If labeling = is included, rot_labels should be inside the labeling function, for example:

labeling = labeling_border(rot_labels = c(0, 0, 0, 0), ...)